{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a52c80c4-1ea2-4d1e-b582-fac51081e76d",
   "metadata": {},
   "source": [
    "<center><img src=https://raw.githubusercontent.com/feast-dev/feast/master/docs/assets/feast_logo.png width=400/></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "576a8e30-fe4c-4eda-bc56-9edd7fde3385",
   "metadata": {},
   "source": [
    "# Credit Risk Data Preparation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f3fbd5a-1587-4b4e-9263-a57490657337",
   "metadata": {},
   "source": [
    "Predicting credit risk is an important task for financial institutions. If a bank can accurately determine the probability that a borrower will pay back a future loan, then they can make better decisions on loan terms and approvals. Getting credit risk right is critical to offering good financial services, and getting credit risk wrong could mean going out of business.\n",
    "\n",
    "AI models have played a central role in modern credit risk assessment systems. In this example, we develop a credit risk model to predict whether a future loan will be good or bad, given some context data (presumably supplied from the loan application). We use the modeling process to demonstrate how Feast can be used to facilitate the serving of data for training and inference use-cases.\n",
    "\n",
    "In this notebook, we prepare the data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d05715f-ddb8-42de-8f0c-212dcbad9e0e",
   "metadata": {},
   "source": [
    "### Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fba29f9-db1f-4ceb-b066-5b2df2c95d33",
   "metadata": {},
   "source": [
    "*The following code assumes that you have read the example README.md file, and that you have setup an environment where the code can be run. Please make sure you have addressed the prerequisite needs.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8a897b19-6f82-4631-ae51-8a23182ff267",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import Python libraries\n",
    "import os\n",
    "import warnings\n",
    "import datetime as dt\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.datasets import fetch_openml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b944ed48-54b3-43fa-8373-ce788d7e71af",
   "metadata": {},
   "outputs": [],
   "source": [
    "# suppress warning messages for example flow (don't run if you want to see warnings)\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "70788c73-144f-4ecf-b370-c5669c538d93",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Seed for reproducibility\n",
    "SEED = 142"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cfb4dfd0-f583-4aa0-bd39-3ff9fbb80db0",
   "metadata": {},
   "source": [
    "### Pull the Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c206dfc-d551-4002-ae63-ccbb981768fa",
   "metadata": {},
   "source": [
    "The data we will use to train the model is from the [OpenML](https://www.openml.org/) dataset [credit-g](https://www.openml.org/search?type=data&sort=runs&status=active&id=31), obtained from a 1994 German study. More details on the data can be found in the `DESC` attribute and `details` map (see below)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "31a9e964-bdb3-4ae4-b2b4-64bbe0ab93a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = fetch_openml(name=\"credit-g\", version=1, parser='auto')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "58dbf7c2-f40b-4965-baac-6903a27ef622",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "**Author**: Dr. Hans Hofmann  \n",
      "**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)) - 1994    \n",
      "**Please cite**: [UCI](https://archive.ics.uci.edu/ml/citation_policy.html)\n",
      "\n",
      "**German Credit dataset**  \n",
      "This dataset classifies people described by a set of attributes as good or bad credit risks.\n",
      "\n",
      "This dataset comes with a cost matrix: \n",
      "``` \n",
      "Good  Bad (predicted)  \n",
      "Good   0    1   (actual)  \n",
      "Bad    5    0  \n",
      "```\n",
      "\n",
      "It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).  \n",
      "\n",
      "### Attribute description  \n",
      "\n",
      "1. Status of existing checking account, in Deutsche Mark.  \n",
      "2. Duration in months  \n",
      "3. Credit history (credits taken, paid back duly, delays, critical accounts)  \n",
      "4. Purpose of the credit (car, television,...)  \n",
      "5. Credit amount  \n",
      "6. Status of savings account/bonds, in Deutsche Mark.  \n",
      "7. Present employment, in number of years.  \n",
      "8. Installment rate in percentage of disposable income  \n",
      "9. Personal status (married, single,...) and sex  \n",
      "10. Other debtors / guarantors  \n",
      "11. Present residence since X years  \n",
      "12. Property (e.g. real estate)  \n",
      "13. Age in years  \n",
      "14. Other installment plans (banks, stores)  \n",
      "15. Housing (rent, own,...)  \n",
      "16. Number of existing credits at this bank  \n",
      "17. Job  \n",
      "18. Number of people being liable to provide maintenance for  \n",
      "19. Telephone (yes,no)  \n",
      "20. Foreign worker (yes,no)\n",
      "\n",
      "Downloaded from openml.org.\n"
     ]
    }
   ],
   "source": [
    "print(data.DESCR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "53de57ec-0fb6-4b51-9c27-696b059a1847",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original data url:   https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)\n",
      "Paper url:           https://dl.acm.org/doi/abs/10.1145/967900.968104\n"
     ]
    }
   ],
   "source": [
    "print(\"Original data url: \".ljust(20), data.details[\"original_data_url\"])\n",
    "print(\"Paper url: \".ljust(20), data.details[\"paper_url\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b2c2514-484e-46cb-aedc-89a301266f44",
   "metadata": {},
   "source": [
    "### High-Level Data Inspection"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a76af306-caba-403d-a9cb-b5de12573075",
   "metadata": {},
   "source": [
    "Let's inspect the data to see high level details like data types and size. We also want to make sure there are no glaring issues (like a large number of null values)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "20fb82c4-ed8d-42f8-b386-c7ebdc9bf786",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 1000 entries, 0 to 999\n",
      "Data columns (total 21 columns):\n",
      " #   Column                  Non-Null Count  Dtype   \n",
      "---  ------                  --------------  -----   \n",
      " 0   checking_status         1000 non-null   category\n",
      " 1   duration                1000 non-null   int64   \n",
      " 2   credit_history          1000 non-null   category\n",
      " 3   purpose                 1000 non-null   category\n",
      " 4   credit_amount           1000 non-null   int64   \n",
      " 5   savings_status          1000 non-null   category\n",
      " 6   employment              1000 non-null   category\n",
      " 7   installment_commitment  1000 non-null   int64   \n",
      " 8   personal_status         1000 non-null   category\n",
      " 9   other_parties           1000 non-null   category\n",
      " 10  residence_since         1000 non-null   int64   \n",
      " 11  property_magnitude      1000 non-null   category\n",
      " 12  age                     1000 non-null   int64   \n",
      " 13  other_payment_plans     1000 non-null   category\n",
      " 14  housing                 1000 non-null   category\n",
      " 15  existing_credits        1000 non-null   int64   \n",
      " 16  job                     1000 non-null   category\n",
      " 17  num_dependents          1000 non-null   int64   \n",
      " 18  own_telephone           1000 non-null   category\n",
      " 19  foreign_worker          1000 non-null   category\n",
      " 20  class                   1000 non-null   category\n",
      "dtypes: category(14), int64(7)\n",
      "memory usage: 71.0 KB\n"
     ]
    }
   ],
   "source": [
    "df = data.frame\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a384932a-40df-45f6-bfbc-a9cf6c708f1b",
   "metadata": {},
   "source": [
    "We see that there are 21 columns, each with 1000 non-null values. The first 20 columns are contextual fields with `Dtype` of `category` or `int64`, while the last field is actually the target variable, `class`, which we wish to predict. \n",
    "\n",
    "From the description (above), the `class` tells us whether a loan to a customer was \"good\" or \"bad\". We are anticipating that patterns in the contextual data, as well as their relationship to the class outcomes, can give insight into loan classification. In the following notebooks, we will build a loan classification model that seeks to encode these patterns and relationships in its weights, such that given a new loan application (context data), the model can predict whether the loan (if approved) will be good or bad in the future."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a451c9a3-0390-4d5a-b687-c59f52445eb1",
   "metadata": {},
   "source": [
    "### Data Preparation For Demonstrating Feast"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc4e7653-b118-44c3-ade3-f1b217b112fc",
   "metadata": {},
   "source": [
    "At this point, it's important to bring up that Feast was developed primarily to work with production data. Feast requires datasets to have entities (in our case, IDs) and timestamps, which it uses in joins. Feast can support joining data on multiple entities (like primary keys in SQL), as well as \"created\" timestamps and \"event\" timestamps. However, in this example, we'll keep things more simple.\n",
    "\n",
    "In a real loan application scenario, the application fields (in a database) would be associated with a timestamp, while the actual loan outcome (label) would be determined much later and recorded separately with a different timestamp.\n",
    "\n",
    "In order to demonstrate Feast capabilities, such as point-in-time joins, we will mock IDs and timestamps for this data. For IDs, we will use the original dataframe index values. For the timestamps, we will generate random values between \"Tue Sep 24 12:00:00 2023\" and \"Wed Oct  9 12:00:00 2023\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "9d6ec4f6-9410-4858-a440-45dccaa0896b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make index into \"ID\" column\n",
    "df = df.reset_index(names=[\"ID\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "055f2cb7-3abf-4d01-be60-e4c7b8ad1988",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add mock timestamps\n",
    "time_format = \"%a %b %d %H:%M:%S %Y\"\n",
    "date = dt.datetime.strptime(\"Wed Oct  9 12:00:00 2023\", time_format)\n",
    "end = int(date.timestamp())\n",
    "start = int((date - dt.timedelta(days=15)).timestamp())  # 'Tue Sep 24 12:00:00 2023'\n",
    "\n",
    "def make_tstamp(date):\n",
    "    dtime = dt.datetime.fromtimestamp(date).ctime()\n",
    "    return dtime\n",
    "    \n",
    "# (seed set for reproducibility)\n",
    "np.random.seed(SEED)\n",
    "df[\"application_timestamp\"] = pd.to_datetime([\n",
    "    make_tstamp(d) for d in np.random.randint(start, end, len(df))\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7800ea9-de9a-4aab-9d77-c4276e7db5f9",
   "metadata": {},
   "source": [
    "Verify that the newly created \"ID\" and \"application_timestamp\" fields were added to the data as expected."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "9516fc5c-7c25-4e60-acba-7400ab6bab42",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ID</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>checking_status</th>\n",
       "      <td>&lt;0</td>\n",
       "      <td>0&lt;=X&lt;200</td>\n",
       "      <td>no checking</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>duration</th>\n",
       "      <td>6</td>\n",
       "      <td>48</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>credit_history</th>\n",
       "      <td>critical/other existing credit</td>\n",
       "      <td>existing paid</td>\n",
       "      <td>critical/other existing credit</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>purpose</th>\n",
       "      <td>radio/tv</td>\n",
       "      <td>radio/tv</td>\n",
       "      <td>education</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>credit_amount</th>\n",
       "      <td>1169</td>\n",
       "      <td>5951</td>\n",
       "      <td>2096</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>savings_status</th>\n",
       "      <td>no known savings</td>\n",
       "      <td>&lt;100</td>\n",
       "      <td>&lt;100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>employment</th>\n",
       "      <td>&gt;=7</td>\n",
       "      <td>1&lt;=X&lt;4</td>\n",
       "      <td>4&lt;=X&lt;7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>installment_commitment</th>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>personal_status</th>\n",
       "      <td>male single</td>\n",
       "      <td>female div/dep/mar</td>\n",
       "      <td>male single</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>other_parties</th>\n",
       "      <td>none</td>\n",
       "      <td>none</td>\n",
       "      <td>none</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>residence_since</th>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>property_magnitude</th>\n",
       "      <td>real estate</td>\n",
       "      <td>real estate</td>\n",
       "      <td>real estate</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>age</th>\n",
       "      <td>67</td>\n",
       "      <td>22</td>\n",
       "      <td>49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>other_payment_plans</th>\n",
       "      <td>none</td>\n",
       "      <td>none</td>\n",
       "      <td>none</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>housing</th>\n",
       "      <td>own</td>\n",
       "      <td>own</td>\n",
       "      <td>own</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>existing_credits</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>job</th>\n",
       "      <td>skilled</td>\n",
       "      <td>skilled</td>\n",
       "      <td>unskilled resident</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num_dependents</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>own_telephone</th>\n",
       "      <td>yes</td>\n",
       "      <td>none</td>\n",
       "      <td>none</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>foreign_worker</th>\n",
       "      <td>yes</td>\n",
       "      <td>yes</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>class</th>\n",
       "      <td>good</td>\n",
       "      <td>bad</td>\n",
       "      <td>good</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>application_timestamp</th>\n",
       "      <td>2023-10-04 17:50:13</td>\n",
       "      <td>2023-09-28 18:10:13</td>\n",
       "      <td>2023-10-03 23:06:03</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                     0                    1  \\\n",
       "ID                                                   0                    1   \n",
       "checking_status                                     <0             0<=X<200   \n",
       "duration                                             6                   48   \n",
       "credit_history          critical/other existing credit        existing paid   \n",
       "purpose                                       radio/tv             radio/tv   \n",
       "credit_amount                                     1169                 5951   \n",
       "savings_status                        no known savings                 <100   \n",
       "employment                                         >=7               1<=X<4   \n",
       "installment_commitment                               4                    2   \n",
       "personal_status                            male single   female div/dep/mar   \n",
       "other_parties                                     none                 none   \n",
       "residence_since                                      4                    2   \n",
       "property_magnitude                         real estate          real estate   \n",
       "age                                                 67                   22   \n",
       "other_payment_plans                               none                 none   \n",
       "housing                                            own                  own   \n",
       "existing_credits                                     2                    1   \n",
       "job                                            skilled              skilled   \n",
       "num_dependents                                       1                    1   \n",
       "own_telephone                                      yes                 none   \n",
       "foreign_worker                                     yes                  yes   \n",
       "class                                             good                  bad   \n",
       "application_timestamp              2023-10-04 17:50:13  2023-09-28 18:10:13   \n",
       "\n",
       "                                                     2  \n",
       "ID                                                   2  \n",
       "checking_status                            no checking  \n",
       "duration                                            12  \n",
       "credit_history          critical/other existing credit  \n",
       "purpose                                      education  \n",
       "credit_amount                                     2096  \n",
       "savings_status                                    <100  \n",
       "employment                                      4<=X<7  \n",
       "installment_commitment                               2  \n",
       "personal_status                            male single  \n",
       "other_parties                                     none  \n",
       "residence_since                                      3  \n",
       "property_magnitude                         real estate  \n",
       "age                                                 49  \n",
       "other_payment_plans                               none  \n",
       "housing                                            own  \n",
       "existing_credits                                     1  \n",
       "job                                 unskilled resident  \n",
       "num_dependents                                       2  \n",
       "own_telephone                                     none  \n",
       "foreign_worker                                     yes  \n",
       "class                                             good  \n",
       "application_timestamp              2023-10-03 23:06:03  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check data (first few records, transposed for readability)\n",
    "df.head(3).T"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72b2105a-b459-4715-aa53-6fe69fc4a210",
   "metadata": {},
   "source": [
    "We'll also generate counterpart IDs and timestamps on the label data. In a real-life scenario, the label data would come separate and later relative to the loan application data. To mimic this, let's create a labels dataset with an \"outcome_timestamp\" column with a variable lag from the application timestamp of 30 to 90 days."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "e214478b-ed9b-4354-ba6f-4117813c56c3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add (lagged) label timestamps (30 to 90 days)\n",
    "def lag_delta(data, seed):\n",
    "    np.random.seed(seed)\n",
    "    delta_days = np.random.randint(30, 90, len(data))\n",
    "    delta_hours = np.random.randint(0, 24, len(data))\n",
    "    delta = np.array([dt.timedelta(days=int(delta_days[i]), hours=int(delta_hours[i])) for i in range(len(data))])\n",
    "    return delta\n",
    "\n",
    "labels = df[[\"ID\", \"class\"]]\n",
    "labels[\"outcome_timestamp\"] = pd.to_datetime(df.application_timestamp + lag_delta(df, SEED))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "356a7225-db20-4c15-87a3-4a0eb3127475",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>class</th>\n",
       "      <th>outcome_timestamp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>good</td>\n",
       "      <td>2023-11-24 22:50:13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>bad</td>\n",
       "      <td>2023-11-03 12:10:13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>good</td>\n",
       "      <td>2023-11-30 22:06:03</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID class   outcome_timestamp\n",
       "0   0  good 2023-11-24 22:50:13\n",
       "1   1   bad 2023-11-03 12:10:13\n",
       "2   2  good 2023-11-30 22:06:03"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check labels\n",
    "labels.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a29f754-f758-402b-ac42-2dcfcee3b7fc",
   "metadata": {},
   "source": [
    "You can verify that the `outcome timestamp` has a difference of 30 to 90 days from the \"application_timestamp\" (above)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e720ce24-e092-4fcd-be3e-68bb18f4d2a7",
   "metadata": {},
   "source": [
    "### Save Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5cae0578-8431-46c7-8d64-e52146f47d46",
   "metadata": {},
   "source": [
    "Now that we have our data prepared, let's save it to local parquet files in the `data` directory (parquet is one of the file formats supported by Feast).\n",
    "\n",
    "One more step we will add is splitting the context data column-wise and saving it in two files. This step is contrived--we don't usually split data when we don't need to--but it will allow us to demonstrate later how Feast can easily join datasets (a common need in Data Science projects)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "cebef56c-1f54-4d31-a545-75d708d38579",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the data directory if it doesn't exist\n",
    "os.makedirs(\"Feature_Store/data\", exist_ok=True)\n",
    "\n",
    "# Split columns and save context data\n",
    "a_cols = [\n",
    "    'ID', 'checking_status', 'duration', 'credit_history', 'purpose',\n",
    "    'credit_amount', 'savings_status', 'employment', 'application_timestamp',\n",
    "    'installment_commitment', 'personal_status', 'other_parties',\n",
    "]\n",
    "b_cols = [\n",
    "    'ID', 'residence_since', 'property_magnitude', 'age', 'other_payment_plans',\n",
    "    'housing', 'existing_credits', 'job', 'num_dependents', 'own_telephone',\n",
    "    'foreign_worker', 'application_timestamp'\n",
    "]\n",
    "\n",
    "df[a_cols].to_parquet(\"Feature_Store/data/data_a.parquet\", engine=\"pyarrow\")\n",
    "df[b_cols].to_parquet(\"Feature_Store/data/data_b.parquet\", engine=\"pyarrow\")\n",
    "\n",
    "# Save label data\n",
    "labels.to_parquet(\"Feature_Store/data/labels.parquet\", engine=\"pyarrow\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8d5de9f-bd27-4e95-802c-b121743dd1b0",
   "metadata": {},
   "source": [
    "We have saved the following files to the `Feature_Store/data` directory: \n",
    "- `data_a.parquet` (training data, a columns)\n",
    "- `data_b.parquet` (training data, b columns)\n",
    "- `labels.parquet` (label outcomes)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af6355dc-ff5b-4b3f-b0bd-3c4020ef67e8",
   "metadata": {},
   "source": [
    "With the feature data prepared, we are ready to setup and deploy the feature store. \n",
    "\n",
    "Continue with the [02_Deploying_the_Feature_Store.ipynb](02_Deploying_the_Feature_Store.ipynb) notebook."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
